Visualizing Data with Seaborn (Python Track)

2024-09-11

Introduction

  • What is Seaborn?
  • Why use Seaborn for data visualization?
  • Brief overview of the session

Source Code

Setting Up the Environment

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from matplotlib import rcParams

# Set global font properties to Arial
rcParams.update(
    {
        "font.family": "sans-serif",
        "font.sans-serif": "Arial",
        "pdf.fonttype": 42,  # Embed fonts as Type 3 fonts for compatibility
        "ps.fonttype": 42,
        "text.usetex": False,
        "svg.fonttype": "none",
    }
)


def stardize_columns(df):
    df.columns = [" ".join(col.strip().split()) for col in df.columns]
    # Basic data cleaning
    df["DATE OF OCCURRENCE"] = pd.to_datetime(df["DATE OF OCCURRENCE"])


# Load the data
df = pd.read_csv("data/Crimes_One_year_prior_to_present_first_1001.csv")
stardize_columns(df)

Understanding the Dataset

# Display basic information about the dataset
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   CASE#                  1000 non-null   object        
 1   DATE OF OCCURRENCE     1000 non-null   datetime64[ns]
 2   BLOCK                  1000 non-null   object        
 3   IUCR                   1000 non-null   object        
 4   PRIMARY DESCRIPTION    1000 non-null   object        
 5   SECONDARY DESCRIPTION  1000 non-null   object        
 6   LOCATION DESCRIPTION   998 non-null    object        
 7   ARREST                 1000 non-null   object        
 8   DOMESTIC               1000 non-null   object        
 9   BEAT                   1000 non-null   int64         
 10  WARD                   1000 non-null   int64         
 11  FBI CD                 1000 non-null   object        
 12  X COORDINATE           999 non-null    float64       
 13  Y COORDINATE           999 non-null    float64       
 14  LATITUDE               999 non-null    float64       
 15  LONGITUDE              999 non-null    float64       
 16  LOCATION               999 non-null    object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(10)
memory usage: 132.9+ KB
None

data source: https://data.cityofchicago.org/Public-Safety/Crimes-Map

Understanding the Dataset

from IPython.display import display, HTML

# Convert the first 10 rows of the DataFrame to HTML
df_html = df.head(10).to_html(classes="dataframe", index=False)

# Wrap the table in a div with scrolling
html_content = f"""
<div style="max-height: 400px; overflow: auto;">
    <style>
        .dataframe {{
            font-size: 12px;
            border-collapse: collapse;
            width: 100%;
        }}
        .dataframe th, .dataframe td {{
            border: 1px solid #ddd;
            padding: 8px;
            text-align: left;
        }}
        .dataframe tr:nth-child(even) {{background-color: #f2f2f2;}}
        .dataframe th {{
            background-color: #4CAF50;
            color: white;
            position: sticky;
            top: 0;
        }}
    </style>
    {df_html}
</div>
"""

display(HTML(html_content))

Understanding the Dataset

CASE# DATE OF OCCURRENCE BLOCK IUCR PRIMARY DESCRIPTION SECONDARY DESCRIPTION LOCATION DESCRIPTION ARREST DOMESTIC BEAT WARD FBI CD X COORDINATE Y COORDINATE LATITUDE LONGITUDE LOCATION
JH117298 2024-01-16 01:00:00 038XX W DIVERSEY AVE 0810 THEFT OVER $500 STREET N N 2524 35 06 1150337.0 1918345.0 41.931844 -87.722951 (41.931843966, -87.722950868)
JG561057 2023-12-31 16:30:00 004XX N WABASH AVE 0460 BATTERY SIMPLE STREET N N 1834 42 08B 1176592.0 1902931.0 41.888994 -87.626935 (41.888993854, -87.626934833)
JG512939 2023-11-21 14:28:00 056XX S ELIZABETH ST 143A WEAPONS VIOLATION UNLAWFUL POSSESSION - HANDGUN RESIDENCE - YARD (FRONT / BACK) N N 713 16 15 1168951.0 1867382.0 41.791613 -87.656025 (41.791613294, -87.656024853)
JG496628 2023-11-08 15:27:00 059XX N GLENWOOD AVE 0460 BATTERY SIMPLE SCHOOL - PUBLIC BUILDING Y N 2013 48 08B 1165910.0 1939379.0 41.989244 -87.665120 (41.989243623, -87.665119726)
JG512358 2023-11-21 02:12:00 049XX W SCHUBERT AVE 1320 CRIMINAL DAMAGE TO VEHICLE ALLEY N N 2521 31 14 1143030.0 1917505.0 41.929679 -87.749824 (41.929678531, -87.749824286)
JG496031 2023-11-08 07:00:00 075XX S STONY ISLAND AVE 0460 BATTERY SIMPLE HOSPITAL BUILDING / GROUNDS N N 411 8 08B 1188234.0 1855158.0 41.757631 -87.585708 (41.757630995, -87.585708249)
JG512359 2023-11-21 01:38:00 002XX W 37TH PL 0560 ASSAULT SIMPLE APARTMENT Y Y 915 3 08A 1175127.0 1880095.0 41.826363 -87.632999 (41.826363218, -87.632998863)
JG444480 2023-09-30 01:50:00 034XX W FLOURNOY ST 0486 BATTERY DOMESTIC BATTERY SIMPLE VEHICLE NON-COMMERCIAL N Y 1133 24 08B 1153640.0 1896821.0 41.872715 -87.711386 (41.872714939, -87.711386229)
JG519372 2023-11-17 16:30:00 004XX N PINE AVE 031A ROBBERY ARMED - HANDGUN SIDEWALK N N 1523 37 03 1139464.0 1902305.0 41.888034 -87.763300 (41.888033817, -87.763299736)
JG483237 2023-10-28 19:30:00 068XX S DR MARTIN LUTHER KING JR DR 0910 MOTOR VEHICLE THEFT AUTOMOBILE VEHICLE NON-COMMERCIAL N N 322 6 07 1180091.0 1860009.0 41.771133 -87.615403 (41.771132967, -87.615402602)

Introduction to Seaborn Plot Types

  • Overview of common Seaborn plot types
  • When to use each plot type
  • Basic syntax and structure
  • Complex plot type

Categorical Plots: Bar Plot

sns.countplot(data=df, y='PRIMARY DESCRIPTION', order=df['PRIMARY DESCRIPTION'].value_counts().index[:10])
plt.title('Top 10 Crime Types')
plt.show()

Further Exploration

from wordcloud import WordCloud
import matplotlib.pyplot as plt


# Combine all secondary descriptions into a single string
text = " ".join(df["SECONDARY DESCRIPTION"].dropna())

# Create and generate a word cloud image
wordcloud = WordCloud(
    width=800, height=400, background_color="white", min_font_size=10
).generate(text)

# Display the generated image
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Further Exploration

Categorical Plots: Box Plot

df["DAY_OF_WEEK"] = df["DATE OF OCCURRENCE"].dt.day_name()
plt.figure(figsize=(12, 5))
sns.boxplot(data=df, x="DAY_OF_WEEK", y="DATE OF OCCURRENCE").set_ylabel("Date")
plt.title("Distribution of Crimes by Day of the Week")
plt.show()

Categorical Plots: Violin Plot

plt.figure(figsize=(12, 5))
sns.violinplot(data=df, x="DAY_OF_WEEK", y="DATE OF OCCURRENCE").set_ylabel("Date")
plt.show()

Categorical Plots: Box-and-Whisker Plot

plt.figure(figsize=(12, 5))
sns.boxenplot(data=df, x="DAY_OF_WEEK", y="DATE OF OCCURRENCE").set_ylabel("Date")
plt.show()

Distribution Plots: Histogram

df["HOUR"] = df["DATE OF OCCURRENCE"].dt.hour
plt.figure(figsize=(12, 5))
sns.histplot(data=df, x="HOUR", bins=24, kde=True)
plt.title("Distribution of Crimes by Hour of the Day")
plt.show()

Distribution Plots: KDE Plot

plt.figure(figsize=(12, 6))
sns.kdeplot(data=df, x="HOUR", hue="PRIMARY DESCRIPTION", common_norm=False)
plt.title("Distribution of Different Crime Types by Hour")
plt.show()

Relational Plots: Scatter Plot

plt.figure(figsize=(12, 8))
sns.scatterplot(data=df, x="LONGITUDE", y="LATITUDE", hue="PRIMARY DESCRIPTION")
plt.title("Geographical Distribution of Crimes")
plt.show()

Relational Plots: Scatter Plot

Relational Plots: Line Plot

crime_counts = df.groupby("DATE OF OCCURRENCE").size().reset_index(name="COUNT")
plt.figure(figsize=(10, 5))
sns.lineplot(data=crime_counts, x="DATE OF OCCURRENCE", y="COUNT")
plt.title("Crime Trends Over Time")
plt.xticks(rotation=45)
plt.show()

Advanced Customization

plt.figure(figsize=(14, 6))
sns.set_style("whitegrid")
sns.set_palette("deep")

g = sns.countplot(
    data=df,
    y="PRIMARY DESCRIPTION",
    order=df["PRIMARY DESCRIPTION"].value_counts().index[:10],
)

g.set_title("Top 10 Crime Types", fontsize=20)
g.set_xlabel("Count", fontsize=14)
g.set_ylabel("Crime Type", fontsize=14)

for i, v in enumerate(df["PRIMARY DESCRIPTION"].value_counts()[:10]):
    g.text(v + 3, i, str(v), color="black", va="center")

plt.tight_layout()
plt.show()

Advanced Customization

Summary

Heatmap

  • Useful for visualizing correlation between variables
  • Can show patterns and relationships in complex datasets
# Select numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Compute correlation matrix
corr_matrix = df[numeric_cols].corr()

# Create heatmap
plt.figure(figsize=(6, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", linewidths=0.5)
plt.xticks(rotation=45, ha="right")
plt.show()  ## Customized Heatmap 

Heatmap

Customized Heatmap

# Select numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Compute correlation matrix
corr_matrix = df[numeric_cols].corr()

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(8, 8))

# Create heatmap with only upper triangle
sns.heatmap(
    corr_matrix,
    mask=mask,
    annot=True,
    cmap="coolwarm",
    vmin=-1,
    vmax=1,
    center=0,
    square=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.8},
    fmt=".2f",
)
plt.xticks(rotation=45, ha="right")
plt.show()

Customized Heatmap

Pair Plot

  • Useful for exploring relationships between multiple variables
  • Creates a grid of scatter plots for each pair of variables

Pair Plot

# Select relevant columns for the pair plot
cols_to_plot = ["X COORDINATE", "Y COORDINATE", "LATITUDE", "LONGITUDE"]

# Add hour of day
df["HOUR"] = pd.to_datetime(df["DATE OF OCCURRENCE"]).dt.hour

# Create the pair plot
plt.figure(figsize=(5, 5))
pairplot = sns.pairplot(
    df[cols_to_plot + ["HOUR", "PRIMARY DESCRIPTION"]],
    hue="PRIMARY DESCRIPTION",
    palette="viridis",
    plot_kws={"alpha": 0.6},
    diag_kind="kde",
)
plt.tight_layout()
plt.show()

Pair Plot

<Figure size 480x480 with 0 Axes>

Regression Plot

  • Visualizes the relationship between two variables
  • Includes a linear regression line and confidence interval
sns.lmplot(
    data=df,
    x="BEAT",
    y="WARD",
    col="ARREST",
    row="DOMESTIC",
    height=3,
    aspect=2,
    facet_kws=dict(sharex=False, sharey=False),
    scatter_kws={"alpha": 0.5},
)
plt.title("Regression Plot: Latitude vs Longitude of Crime Occurrences")
plt.show()

Regression Plot

Advanced Seaborn: FacetGrid

  • Demonstrates how to create multiple plots in a grid
  • Useful for comparing distributions across categories
# Create a FacetGrid
plt.figure(figsize=(4, 4))
g = sns.FacetGrid(df, col="PRIMARY DESCRIPTION", col_wrap=3, height=4, aspect=1.5)

# Map a histogram to each subplot
g.map(plt.hist, "HOUR", bins=24)

# Customize the plot
g.set_axis_labels("Hour of Day", "Count")
g.set_titles("{col_name}")
g.fig.suptitle("Distribution of Crimes by Hour for Different Crime Types", y=1.02)
g.tight_layout()
plt.show()

Advanced Seaborn: FacetGrid

<Figure size 384x384 with 0 Axes>

Try new features

import seaborn.objects as so

# Create the plot
(
    so.Plot(df, x="BEAT", y="WARD")
    .add(so.Dot())
    .label(x="Longitude", y="Latitude")
    .layout(size=(12, 8))
    .plot()
)
plt.show()

Try new features

Best Practices and Tips

  • Choosing the right plot for your data
  • Pay attention to color choices and accessibility
  • Avoiding common pitfalls
  • Consider the story your visualization is telling

Q&A

  • Q&A session

Additional resources Resours

  • https://seaborn.pydata.org
  • https://www.data-to-viz.com/
  • https://data.cityofchicago.org/Public-Safety/Crimes-One-year-prior-to-present/x2n5-8w5q/data
  • https://quarto.org/docs/presentations/revealjs/